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1 Introduction 

Recent advancements in macroeconomic data collection have led to an increased focus on 
high-dimensional time series analysis. A more efficient and precise analysis can thus he 
realized if we elicit information appropriately from a large number of explanatory vari¬ 
ables. However, a higher-dimensional model does not necessarily yield better performance 
in terms of forecasting and parameter estimation; in fact, the performance varies depending 
on the dimensionality and which estimation method is considered. Without appropriate di¬ 
mension reduction, performance may be poor owing to accumulated estimation losses from 
redundant or unimportant variables. After seminal papers on factor-based (diffusion index) 
forecasting, such as Stock and Watson (2002), this is now common tool for forecasting 
with large datasets. Specifically, Stock and Watson (2012) showed that factor-based fore¬ 
casts have a good performance in comparison with existing forecasting methods, including 
autoregressive forecast, pretest methods, Bayesian model averaging, empirical Bayes, and 
bagging. They concluded that it seemed difficult to outperform a factor-based forecast with¬ 
out introducing nonlinearity and/or time-varying parameters to a forecast model. 

In this paper, we tackle the high-dimensional forecasting and estimation problem from 
another theoretical and empirical points of view. We employ sparse modeling, which can al¬ 
low for ultrahigh dimensionality, where the number of regressors diverges sub-exponentially. 
The unknown sparsity can be recovered using afolded-concave penalized regression to pur¬ 
sue both prediction efficiency and variable selection consistency. In particular, we consider 
penalties including the smoothly clipped absolute deviation (SCAD) penalty introduced by 
Fan and Li (2001), the minimax concave penalty (MCP) proposed by Zhang (2010) as well 
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as the /"i-penalty (Lasso) proposed by Tibshirani (1996). Previous studies on maeroeco- 
nomic forecasting using sparse modeling include Bai and Ng (2008), De Mol et al. (2008), 
Kock and Callot (2015), Marsilli (2014), and Nicholson et al. (2015), but basically their 
estimation strategies are limited to the -penalty. Although the -penalty is expected to 
perform well as do the SCAD and MCP theoretically as we see in a later section, this is 
often insufficient in terms of model selection consistency while the SCAD and MCP can 
have this desirable property. Moreover, it is difficult to find a statistical theory of penalized 
regression estimators in time a series context. 

In the first half of this paper, we provide the comprehensive theoretical properties of the 
penalized regression estimator under suitable conditions for macroeconometrics from the 
perspective of both prediction efficiency and variable selection consistency. In fact, the the¬ 
oretical aspects have been explored by many recent works on statistics, including Biilmann 
and van de Geer (2011), Fan and Lv (2011), Fan and Lv (2013), and Loh and Wainwright 
(2014), as well as the references therein. However, the results of these studies are not suf¬ 
ficient for time series econometrics. We in this paper derive a non-asymptotic upper bound 
for the prediction loss called the oracle inequality. This ensures that the forecasting value is 
reliable and it is an optimal forecast in the asymptotic sense. Likewise, we also show the es¬ 
timation precision of the regression coefficient and the model selection consistency, known 
as the oracle property, that is, it selects the correct subset of predictors and estimates the 
non-zero coefficients as efficiently as would be possible if we knew which variables were 
irrelevant. The oracle property provides another insight into the modeling of the variable of 
interest. In this regard, models can be selected by information criteria, such as the AIC and 
BIC. These have become popular owing to their tractability, however, they are limited when 
dealing with high-dimensional models because they demand an exhaustive search over all 
submodels. In contrast, the SCAD-type penalized regression yields simultaneous estimation 
and model selection, even in the ultrahigh-dimensional case. 

In the second half of the paper, we shed light on the validity of the penalized regression 
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in macroeconometrics by introducing two empirical applications. The first one focuses on 
the oracle inequality. We consider to forecast quarterly U.S. real GDP with a large number of 
monthly predictors using MIDAS (Mixed DAta Sampling) regression framework originally 
proposed by Ghysels et al. (2007). Since the total number of parameters is much larger than 
that of observations, this situation should be treated as an ultra-high dimensional problem. 
In contrast to the original MIDAS model of Ghysels et al. (2007), the penalized regression 
enables us to forecast the quarterly GDP using a large number of monthly predictors without 
imposing a distributed lag structure on the regression coefficients. We find that the forecast¬ 
ing performance of the penalized regression is better than that of the factor-based MIDAS 
(F-MIDAS) regression proposed by Marcellino and Schumacher (2010) and is competitive 
with the nowcasting model based on the state-space representation in real-time forecasting. 
The second application concentrates on the oracle property. We investigate how well the 
penalized regression can screen a (hidden) fund manager’s portfolio from large-dimensional 
NYSE stock price data. We construct artificial portfolios, and then we confirm the penalized 
regression using the SCAD-type penalty effectively detects the relevant stocks that should 
be contained in the portfolio. These two convincing empirical applications motivate us to 
apply the penalized regression to macroeconomic time series broadly. 

The remainder of the paper is organized as follows. Section 2 specifies an ultrahigh¬ 
dimensional time series regression model and the estimation scheme. The statistical validity 
of the method is confirmed in Section 3 by deriving the oracle inequality and the oracle prop¬ 
erty. Section 4 illustrates how we can apply the penalized regression for macroeconomic 
time series by two empirical analyses. Section 5 concludes. The proofs and miscellaneous 
results are collected in the Appendix. 
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2 Regression Model 


The regression model to be eonsidered is 

y = XfiQ + u, ( 1 ) 

where y = (ji, ... ,^ 7 ’)^ is a response veetor, X = (xi, ..., is a covariate matrix with 
Xf = {xt\,... ,Xtpy, u = {u\,... ^utY is an error vector, and ySg = 0 ®OA>)®as)^ is a p- 
dimensional unknown sparse parameter vector with = (ySoj,... ,/3o,i)^ 5-dimensional 
vector of nonzero elements and = 0. We also denote 7 th column vector of X by 
Xj = {x\j,... ,XTjy. Further, we write X = (Xa,Xb) corresponding to the decomposi¬ 
tion of the parameter vector. Throughout the paper, we assume that for each i, {x,iU,]t is a 
martingale difference sequence with respect to an appropriate filtration. 

The objective of the paper is how we construct an efficient /z-step ahead forecast value of 
yj+h and how we select variables consistently when dimension p is much larger than T. In 
such cases, X may contain many irrelevant columns, so that the sparsity assumption on Pq 
may be appropriate. In this paper, we consider an ultrahigh-dimensional case, meaning that 
p diverges sub-exponentially (non-polynomially). At the same time, the degree of sparsity 
5 may also diverge, but s < T must be satisfied. The estimation procedure should select 
a relevant model as well as consistently estimate the parameter vector. The estimator is 
defined as a minimizer of the objective function 

QriP) = (2r)-‘||j - XPWI + WpmWi (2) 

ovQX p 6 RP, where p^iP) = (PAiPil), ■ ■ ■,PAiPpDY and p^v), for v > 0, is a penalty 
function indexed by a regularization parameter /!(= Aj) > 0. The penalty function p^ 
takes forms such as the ^i-penalty (Lasso) by Tibshirani (1996), SCAD penalty by Fan and 
Li (2001), and MCP by Zhang (2010). These penalties belong to a family of so-called 
folded-concave penalties because of their functional forms. The statistical properties have 
been developed for models with a deterministic covariate and i.i.d. Gaussian errors in the 
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Figure 1: Shape of Folded-Coneave Penalties: MCP and Lasso. 


literature on high-dimensional statisties. We thoroughly investigate these properties, while 
relaxing the assumptions suffieiently to inelude many time series models. 

We introduce the three penalties to be used. Let v denote a positive variable. The £\- 
penalty is given by Pa{v) = Tv, and we then obtain p\{v) = A and p”{v) = 0. The SCAD 
penalty is defined by 


Pa{v) = Avl{v < T} -I- 


aAv - 0.5(v^ -I- A^) 
a — 1 


1{/1 < V < aA) -I- 


A^(a^ - 1 ) 

2{a - 1) 


l{v > aA}. 


Its derivative is 


for some a > 2. Then we have p" (v) = -(a - 1)“^ l{v e (T, aA)}. The MCP is defined by 

( v^\ 1 7 

PAiv) = Tv - — l{v < aA} + -aA l{v > aA}. 

\ 2a 2 


Its derivative is p^(v) = a“‘(aT-v)+ forsomea > 1. Thus, we have p"(v) = -a~^\{v < aA}. 
Figure I illustrates a shape of the MCP with several values of tuning parameters a as well 
as that of the Lasso. 
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3 Two Theoretical Results 


In this section, we establish two important theoretical results, the oracle inequality and or¬ 
acle property for time series models. The oracle inequality gives optimal non-asymptotic 
error bounds for estimation and prediction in the sense that the error bounds are of the same 
order of magnitude up to a logarithmic factor as those we would have if we a priori knew 
the relevant variables (Biilmann and van de Geer, 2011). This result strongly supports the 
use of penalized regressions in terms of forecasting accuracy, even in ultrahigh-dimensional 
spaces. Note that we should remark that the inequality provides no information on model 
selection consistency; that is, it is not clear whether the penalized regression correctly dis¬ 
tinguishes the relevant variables contained in the true model from the irrelevant ones. This 
issue is then addressed by the oracle property, which, in turn, states that the estimator ex¬ 
hibits model selection consistency. The existing results have shown the oracle inequality 
and the oracle properly under i.i.d. Gaussian errors and deterministic covariates, but in the 
paper we extend these results to apply to time series models. 

Assumption 1 We have log p = 0{T^) and s = 0(T^°) for some constants 6, do g (0,1). 
Assumption 2 Penalty function />,}(•) is characterized as follows: 

(a) Pa(v) is concave in v g [0, oo) with Pa(0) = 0, 

(b) Pa(v) is nondecreasing, but v i-^ PaMIv (v 0) is nonincreasing in v G [0, oo), 

(c) Pa(v) has a continuous derivative p'^{v) for v G (0, oo) with p\{0-\-) = A, 

(d) There exists p > 0 such that Pa(v) pv^ jl is convex in v e [0, oo). 

Assumption 1 means that the dimensionality of the model, p, diverges sub-exponentially 
as T goes to infinity. Assumption 2 determines a family of folded-concave penalties that 
bridges do- and di-penalties. The SCAD and MCP are included in this family. The dr 
penalty also satisfies this as the boundary of this class. It is known from Lemmas 6 and 7 of 
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Loh and Wainwright (2014) that (d) is true provided that /i > 1/(a - 1) for the SCAD and 
H > lla for the MCP. 

We define the gradient vector and Hessian matrix of (2r)“^ ||j-Z; 6||2 as Gj-fyS) = 

XP)IT and Hj = X^XIT, respectively. Denoting Gqj = Grifio), we. may write 


/ \ 


r \ 

( 

\ 



\ 

x> 


Goat 

1 


XIXb 


Haat 

Habt 




, = - 



= 



to H 

a 

V_ 


Gqbt ^ 


XlXj, 

X-^Xb^ 


Hbat 

Hbbt^ 


3.1 Oracle inequality 

We derive optimal non-asymptotic error bounds for estimation and prediction called the ora¬ 
cle inequality. In the literature, Biilmann and van de Geer (2011, Ch. 6 ) presented a complete 
guide for the inequality using the ^i-penalty with fixed predictors and i.i.d. normal errors. 
We extend the result in two ways. First, the inequality holds for the general model (1). 
Second, we prove the asymptotic equivalence of and the other folded-concave penalties 
characterized by Assumption 2 in the sense that they satisfy the same rate. This indicates that 
the forecasting performance is asymptotically equivalent, irrespective of the folded-concave 
penalty used. We first derive the bounds under two high-level assumptions in Section 3.1.1. 
We next consider the conditions under which the two high-level assumptions are actually 
verified in a reasonable time series setting in Section 3.1.2. Related studies are introduced 
in Appendix A.9. 

3.1.1 General result 

We start with general but high-level assumptions: 

Assumption 3 There are a sequence A = o(l) and a positive constant ci such that the 
complement of the event = {UGorlloo < 4/2}, satisfies = G(p“'^‘)- 
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Assumption 4 There are a diverging sequence m = oiT) and positive constants Cj and y > 
fijl such that the complement of the event £2 = {minveRP, iiviio<m T'^^ll^t'll^/llvll^ > y}, 
satisfies = 0 (exp(-C 2 r)). 


Assumption 3 requires that the gradient vector Gqt to hehave less fluctuate and converge to 
zero with an appropriate rate determined hy A. For example, we should set A = 0((log 
for the case when u is i.i.d. normal and X is deterministic. Assumption 4 is a stochastic ver¬ 
sion of the restricted strong convexity studied hy Negahhan et al. (2012). This prevents the 
minimum eigenvalue of the suh-matrix of Hessian matrix Hj from being too small. These 
two assumptions fully control the randomness of the regression model, meaning that irre¬ 
spective of the dependence structure the model possesses. Theorem 1 helow holds as long as 
they are satisfied. The problem is what reasonable conditions on X and u satisfy Assump¬ 
tion 3 and 4. In fact, these can easily be verified for i.i.d. Gaussian u and deterministic X. 
However, we may anticipate that it becomes quite unclear whether these assumptions hold 
or not once the model departs from such simple settings. 

Under the assumptions listed above, we can derive the following result: 


Theorem 1 Let Assumptions 1-4 hold. Then, there exists a local minimizer ofQrifi) on 
{fi eW : ||j6||o < m - s} such that, with probability at least 1 - 0(p^^^) - 0 (exp(-C 2 T)), the 
following hold: 


(a) (Estimation error in £ 2 -norm} ||jS - JS 0 II 2 ^ 

(b) (Estimation error in £\-norm) -PqWi < 

(c) (Prediction loss) T~^^^\\X(fi - fif )\\2 < 


6s^/^A 

2y-p 

24sA 

2y-p 

9s^l^A 


{2y-pyp' 


If 2y - p is assumed to be fixed, the error bounds converge to zero as long as A goes to 
zero relatively faster than s^^^ or s. In a simple setting with i.i.d. Gaussian Ut and fixed A,, 
it is known that A should be given by 0((log pITf^^) as mentioned before, leading to the 
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explicit convergence rates 0{{s\o%plT)^l^). This goes to zero provided that d + do < 1. 
We observe later that the rates become slightly slower in a time series setting. Result (c) 
exhibits an optimal bound for the prediction loss in the d 2 -norm in the sense of Bickel et 
al. (2009). This result justifies using any penalty function specified by Assumption 2 when 
the aim is forecasting in the ultrahigh dimension. To understand the result, we consider a 
simplification in model (1) such that X is deterministic, u is i.i.d. with a unit variance, and 
s = p < T. Then, the squared risk of the OLS estimator becomes 

= T-^E[u^X(X^X)-^X^u] = T'^tr/ = s/T. 

Consider the case p > T > s. If we knew the true model A, we could choose the correct 
s variables from X, leading to the risk s/T. However, since A is unknown, the additional 
logarithm factor, which is regarded as the price to pay for not knowing A, is inserted. 

3.1.2 When does the general result hold? 

Theorem 1 has established the non-asymptotic error bounds for the penalized regression 
estimators and prediction error under general, yet high-level, assumptions. Specifically, 
Assumptions 3 and 4 should be verified for each model we attempt to employ. Here we 
consider a specific time series model. To consider a specific dependent model, we first 
strengthen the assumption on dimensionality: 

Assumption 5 The dimensionality is given by log p = (f)T^ and s = for some positive 
constants 0, d, and do such that d -l- do < 1- 

In order to specify the processes of X and m, we further assume in the same manner as 
Ahn and Horenstein (2013) that the covariate X and the error u are given by 

A = u = cTuR]!\u, (3) 

where the random matrix Zx s random vector G R^, and deterministic matrices 

Rx e R^^^, Ru e R^^^, and 'Lx e are characterized by the following assumption: 
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Assumption 6 The following conditions hold: 


(a) The entries of Zx and Zu are i.i.d. standard normal random variables. 

(b) Rx, Ru, and Lx are symmetric and positive definite non-random matrices, the min¬ 
imum eigenvalues of which are hounded from below by positive constants Cr^, Cr^, 
and cx, respectively. In addition, we set cr = cr^ A cr^ and cr„ > 0. 

(c) R]I^ = and j are lower triangular matrices whose elements satisfy 

r® = rlf = 1 and V = 0(1) for all 5, where and = 

^x^ = is a positive definite matrix that satisfies cr® = 1 and 

2® < oo for all j, where S® = 2f=i(o-;f )^- 

Gaussianity in condition (a) can be weakened to sub-Gaussianity. Matrices in condition 
(c) are defined based on the Cholesky decomposition and Spectral decomposition under 
condition (b). Model (3) with Assumption 6 covers a wide range of time series processes 
with cross-sectional dependences. A simple example of R]I^ and is given by setting 
r® 1 = Or and cr^^^j = for some constants Qy and (pa- satisfying 10^1 < 1 and \ipa\ < oo with 
other entries all zero. Obviously, this formulation satisfies condition (c) with reducing model 
(3) to an MA(1) process. Other weak stationary processes with cross-sectional dependences 
can be expressed in a similar manner. 

Proposition 1 Let Assumptions 5 and 6 hold with A = cq log(/?r)(log with choosing 

positive constant Cq such that cq > \6cxu, where c^u = hm sup^ max, ,{/?®Lp,-,7?|yVu} < oo. 
Then, Assumption 3 is satisfied with < 6p~^. 

Proposition 2 Let Assumptions 5 and 6 hold with m < and (p^ < \ /2. Then, Assump¬ 
tion 4 is satisfied with y = cgCrI 9 and PiS^) < 2 exp(-C 2 T), where C 2 = 1/2 - (p^. 

Combining Propositions 1 and 2 leads to the non-asymptotic bounds in the time series 
setting specified by Assumptions 5 and 6. 
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Corollary 1 Let Assumptions 2, 5, and 6 hold with the constants being the same as in 
Propositions 1 and 2. Then, there exists a local minimizer fi of Qjifi) on{p & W : ||jS||o < 
m - s) such that, with probability at least 1 - 6p~^ - 2exp{-(l/2 - (p^)T}, the the error 
bounds (a)-(c) of Theorem 1 hold. 

Corollary 1 does not always imply the eonsisteney. Onee the eondition d + do < 1 in 
Assumption 5 is strengthened to 3d + do < 1, the bounds of (a) and (e), given by s^^^A = 
o{T^‘^IHog{pT){\ogplTyi^), converge to zero. Similarly, adding the condition 3d+2do < 1 
entails the bound of (b) converges to zero. 

Compared to the conventional rate, 0^(5 log pjTf^^^, obtained with i.i.d. normal errors 
and fixed covariates, a slightly slower rate o{{\ogpT){s\ogplTfl^^ arises for our time 
series model. We can interpret the additional factor \og{pT) as an extra cost of departure 
from the independent Gaussian world. To understand this, the point is the behavior of the 
process [x,iUt\ for each i. If u, is i.i.d. Gaussian and x,i is deterministic, {x,iU,} becomes 
a sequence of independent normal random variables. Hence, it is easy to control the tail 
probability /’(IIGorlU > d) to be very small by using the inequality P{\Z\ > x) < exp(-x^/2) 
for Z from A(0,1) and for any x > 0. Contrary to this conventional setting, ours assumes 
X, is stochastic, so that [xtiUt] is no more independent Gaussian process. To evaluate the tail 
probability, we may use Azuma-Hoeffding’s inequality together with the assumption that 
{xtiUf] is a martingale difference sequence. In this case, we have to control the boundedness 
of [xtiUt] at the same time, resulting in the additional factor \og{pT) described above. 

3.2 Oracle property 

It is well known that the capacity of the Lasso for model selection is quite limited (e.g.. Fan 
and Lv 2011). If we employ a SCAD-type penalty, however, a stronger and more desirable 
result on variable selection can be obtained. This result is called the oracle property, as 
studied first by Fan and Li (2001). The property admits to be asymptotically equivalent 
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to the maximum likelihood estimate obtained under the eorrect restriction jSg = 0. To derive 
it under a time series setting, we need a different set of conditions; see Appendix A. 1. Define 
d{= dj) = rnin^g^ |^ojl/2, Iqaa = T'E[GqatGq^j], and Jqaa = ^{Haat^- 

Under assumptions in Appendix A.l, we will derive model selection consistency and 
appropriate rate of convergence. The role of Assumption 7 is essentially the same as that 
of Assumption 3. The first condition in Assumption 8 is a variant of the beta-min condition 
in Biilmann and van de Geer (2011, Ch. 7). This is necessary to distinguish the nonzero 
coefficient of relevant variables from zero though it seems stringent in the case of econo¬ 
metric modeling. The second condition p\{d) = 0 is key to achieve the oracle property. This 
is strong enough to exclude the ("i-penalty from Assumption 2. In fact, for the -penalty, 
p\{v) = A(> 0) holds identically for all v > 0. On the other hand, for the SCAD and MCP, 
this holds for a sufficiently large T as long as d/A —> oo is satisfied. Assumptions 9-11 
seem quite natural and are frequently used in stationary time series analysis. Assumption 
12 restricts the asymptotic behavior of the lower-left (p - s) x s submatrix of Hj- This is 
essentially the same as condition (27) of Fan and Lv (2011). 

Letting b eW he such that ||^>||2 = 1, we set = b~^ ^QAA^AtUt and These 

can easily be shown to be a martingale difference sequence and martingale difference array, 
respectively. Note that ^Tt can also be written as T^/^b~^I~l^^GoAT- Assumption 13 is 
required to obtain the asymptotic normality. From Davidson (1994, Ch. 24), this leads to a 
central limit theorem of a martingale difference sequence. If is ergodic stationary, this is 
redundant (Billingsley, 1961). 

Theorem 2 (oracle property) Let Assumptions 1, 2, and 7-12 hold. Then, there exists a 
local minimizerfi = (jS^,)Sg)^ of Qjifi) such that 

(a) (Sparsity) ySg = 0 with probability approaching one; 

(b) (Rate of convergence) - PqaWi = Op((slTfl^). 

In addition, if Assumption 13 holds, then for any b & W satisfying = 1, we have 
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(c) (Asymptotic normality) -Pqa) N{Q, 1). 


The oracle property means that the model selection is consistent in the sense of (a) 
and (b). Moreover, as is understood by result (c), the estimator has the same asymptotic 
efficiency as the (infeasible) MLE obtained with advance knowledge of the true submodel. 
Based on these results, we can estimate ultrahigh-dimensional models without irksome tests 
for zero restrictions on the parameters or an exhaustive search using information criteria. 

4 Empirical Examples 

According to the theoretical results given in the previous sections, the penalized regression 
can have two desirable properties: the oracle inequality and the oracle property. In this 
section, we provide two empirical examples that motivate how well the penalized regression 
works in macroeconometric analyses. The first forecasts the quarterly real U.S. GDP with a 
large number of monthly macroeconomic predictors, and the second screens portfolio from 
a large number of potential securities using NYSE stock price data. 

4.1 Forecasting quarterly U.S. GDP with a large number of predictors 

4.1.1 Penalized MIDAS regression model 

In this section, we illustrate how to apply the penalized regression model to macroeconomic 
time series using the MIDAS forecasting regression. The MIDAS regression model was 
originally proposed by Ghysels et al. (2007) and is now one of standard tool for forecast¬ 
ing with mixed-frequency data, as well as the now-casting model based on the state-space 
representation (e.g., Giannone et al., 2008; Bahbura and Modugno, 2013). The original 
(or basic) MIDAS regression model has an advantage of describing a forecasting regression 
model in a simple and parsimonious way of a distributed lag structure with a few hyper¬ 
parameters. However, the original MIDAS regression model would not be suitable for a 
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situation where the number of predictors in the model is very large. For example, consider 
the original MIDAS regression model with K hyperparameters and N macroeconomic time 
series. Then, the total number of parameters in the original MIDAS regression model re¬ 
mains NK -I- 1 = 0{N). Thus, it invokes a serious efficiency loss if N is large or even it 
makes the model inestimable. On the other hand, the penalized regression enables us to 
estimate the MIDAS regression model without imposing the distributed lag structure on the 
regression coefficients. Moreover, Theorem 1 implies that the forecast value obtained by the 
penalized regression is reliable. In the following, we link the penalized regression model 
(1) to the MIDAS regression model without parameter restrictions, and consider to forecast 
quarterly U.S. GDP with the monthly macroeconomic data using the penalized regression. 

Let {y,, be the MIDAS process in line with Andreou et al. (2010), where the scalar 
yt is the low-frequency variable observed at t = 1 ,..., T, and the A-dimensional vector 
•*-1/^ “ ’ ^m/rn) ^ of higher-frequency variables observed m times between 

t and t - 1. For example, m = 3 if we forecast a quarterly variable with monthly predictors. 
We consider the /r-step-ahead mixed-frequency forecasting regression model with £ lags. 


y, = xJ_f^fio + u„ t= l,...,r. 


(4) 


where x^-h = ..., with xgi, , = (4”■ • • ’ 

k = 2, 3,..., A, )So = ■ ■ ■ ,I^Q,N(+N-ty is the parameter vector and Ut is an error term. 

Here the case /z < 1 (/z = 0,1 /zzz, 2/z7 z,..., (zn - l)/z7z) corresponds to nowcast; we forecast a 
low-frequency variable with the “latest” high-frequency variables released between t- 1 and 
t. For instance, if we consider a quarterly/monthly (m = 3) case, /z = 0 (1/3) means that we 
forecast a quarterly variable in 2015Q2 with monthly data in June (May) 2015 or later. Note 
that model (4) has the same structure as (1) with p := {N - \){C -l- 1) -l- 1 = Ni + N - i 
but it differs from the original MIDAS regression model by Ghysels et al. (2007); our 
model does not employ the distributed lag structure on Xt-h while they used Xt-h{d) = 
(l, • • • ’ instead of Xt-h such that (0^^) = Yipi 
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k = 2,...,p, where Wj^k{6k) s (0,1) and = 1. As mentioned above, the origi¬ 

nal MIDAS model crueially depends on the restrictive distributed lag structure and cannot 
reduce the total number of the parameters to be estimated effectively if N is very large. Al¬ 
ternatively, the MIDAS regression that minimizes the penalized loss can estimate ySg 
forecast yt without the distributed lag structure. 

In a macroeconomic forecasting point of view, it is natural to consider that there is a 
small set of key predictors that contain rich information to forecast y while there are lots 
of redundant predictors. To reduce accumulation of estimation errors, we should model 
y only by using the key predictors. Although the redundant predictors would have “non¬ 
zero” forecasting power, the penalized regression makes their coefficient estimates zero as 
an approximation. In other words, we can say that the sparsity assumption claims there exist 
“targeted predictors” for y (Bai and Ng, 2008). 

Hereafter, we call the MIDAS regression model estimated by the penalized regression as 
“penalized MIDAS regression.” We also note that as a method related to our penalized MI¬ 
DAS regression, Marsilli (2014) proposes a MIDAS regression model with a penalized re¬ 
gression. However, he employed the original MIDAS parsimonious parameterization, which 
completely differs from our model in terms of parameterization as we stressed above. 

4.1.2 Data 

U.S. quarterly real GDP growth is taken from the FRED database. The sample period 
is from 1959Q4 to 2016Q2. We retrieve 117 U.S. monthly macroeconomic time series 
{N = 117) from the FRED-MD database and the series are appropriately detrended ac¬ 
cording to a guideline given in McCracken and Ng (2015). Note that the FRED-MD 
database originally contains a total of 128 series, but we remove 11 series due to the fol¬ 
lowing reasons: the CBOE S&P 100 Volatility Index (VXOCLSx), Consumer sentiment in¬ 
dex (UMCSENTx), Trade weighted U.S. dollar index of major currencies (TWEXMMTH), 
New orders for nondefense capital goods (ANDENOx), New orders for consumer goods 
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(ACOGNO), and New private housing permits (PERMIT, PERMITNE, PERMITMW, PER¬ 
MITS, PERMITW) have no observations from 1959. Eurthermore, our preliminary inspec¬ 
tion found that Reserves of depository institutions nonborrowed (NONBORRES) contained 
extreme changes in Eebruary 2008, which would contaminate our analysis. The sample 
period of the detrended monthly series is from March 1959 (1959:3) to June 2016 (2016:6). 

4.1.3 Forecasting Strategy 

We evaluate the out-of-sample forecasting performance by mean squared forecast errors 
(MSEE) in the evaluation period from 2000Q1 to 2016Q2. The parameter estimates are ob¬ 
tained from each estimation period; the initial period is 1959Q4-1999Q4 and the next one 
extends the end point to 2000Q1 with the starting point 1959Q4 being fixed. Eor example, 
the initial forecast error in 2000Q1 is calculated using the estimates from the initial estima¬ 
tion period 1959Q4-1999Q4, and the second forecast error in 2000Q2 uses the estimates 
from the second estimation period 1959Q4-2000Q1. We suppose that the forecast regres¬ 
sion consists of eight lags {£ = 8), so that the total number of parameters for the forecasting 
regression to be estimated is Ni + N - £ = 117x8-1-117-8 = 1045, including a constant 
term. The penalized MIDAS regression is expected to be robust to a choice of £, as long 
as we choose £ to be moderately large, because the penalized regression conducts model 
selection as well as parameter estimation. To investigate the forecasting performance of the 
penalized MIDAS regression model with a variety of horizons, we examine cases where 
= 0,1 /3,2/3,1,4/3,5/3,2 in the same manner as Clements and Galvao (2008) and Mar- 
cellino and Schumacher (2010). The cases h = 0,1/3, and 2/3 correspond to nowcasting in 
the sense that we forecast contemporaneous or very short-forecast-horizon quarterly GDP 
growth using monthly series before the official announcement of the GDP, while the case 
= 2 is a forecast with a relatively long horizon. The sample size of the estimation period 
T gradually increases and varies depending on h\ for example, T ranges from 161 to 227 if 
h = 0, and from 159 to 225 if h = 2. 
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Finally, we need to determine the values of the tuning parameters, a and d, in advanee 
of the penalized MIDAS regression regression. Following the guidelines by Breheny and 
Huang (2011, pp. 19 and 21) with our preliminary inspection of the overall samples, we set 
a = 12 for the SCAD and MCP, although the performance could be improved by a more 
careful choice. The value of A is selected by 10-fold cross-validation. The validity was 
confirmed by Uematsu and Tanaka (2015). All estimations for the penalized regression are 
conducted using R 3.2.1 with the ncvreg package of Breheny and Huang (2011). 

4.1.4 Forecast performance 

To measure the performance appropriately, we consider two types of datasets. The first is a 
complete dataset, that is, there are no missing values in the dataset. The second is a real-time 
dataset, which has jagged/ragged edge pattern due to the publication lag of the series. 

4.1.5 Forecast performance in complete data 

We use data from 1959Q4-2016Q1 for the GDP and 1959:3 to 2016:3 to retrieve a complete 
dataset. We consider the following three evaluation periods: (/) Overall (2000Q1-2016Q1), 
(ii) 1st subsample (2000Q1-2007Q4), and (in) 2nd subsample (2008Q1-2016Q1). This is 
because the unprecedented turmoil of the U.S. economy stemming from the subprime mort¬ 
gage crisis and the ensuing collapse of Lehman Brothers in 2008 would introduce parameter 
instability that would distort the forecast evaluation. As a result, we consider the forecast 
performance of the penalized regression in complete data from a total of 65 (overall), 32 
(1st subsample) and 33 (2nd subsample) squared forecast errors, respectively. 

Tables 1-3 report the mean squared forecast errors (MSFE) of the penalized MIDAS 
regression with the SCAD, MCP, and Lasso, and their two competitors in the overall sample, 
1st subsample, and 2nd subsample, respectively. In the tables, the median squared forecast 
errors are also shown in parentheses to remove contamination by outliers. The all values 
are relative values compared to a naive AR(4) forecast. The two competitors are the factor 
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MIDAS (denoted “Faetor” in the tables) proposed by Mareellino and Schumacher (2010) 
and the two-step penalized regression (post-OLS) procedure (denoted as “post-MCP,” “post- 
SCAD,” “post-Lasso” in the tables) proposed by Belloni and Chemozhukov (2013). The 
factor MIDAS is expected to be one of the strong competitors since the factor-based forecast 
is found to perform well in forecasting real variables (e.g.. Stock and Watson, 2002, 2012; 
De Mol et ah, 2008.). The factor MIDAS considered here is based on the basic MIDAS 
structure with the exponential Almon lag structure of two hyperparameters. The number of 
factors is assumed to be seven (r = 7) based on the information criterion ICp 2 by Bai and 
Ng (2002). Although we can consider the unrestricted Factor MIDAS as in Mareellino and 
Schumacher (2010), which is free from the distributed lag structure, we do not employ it 
because of its intractability caused by high dimensionality. The two-step procedure using 
the Lasso is known as the OLS post-Lasso. Belloni and Chemozhukov (2013) showed it 
could perform at least as well as the Lasso and could be better in some cases. We also 
consider the two-step procedure using the MCP and SCAD penalties. 

First, we consider the nowcasting (0 < /r < 1) cases. Table 1 shows that all methods 
are much better than the naive AR(4) forecast, but that the penalized MIDAS regression 
outperforms the factor MIDAS and the two-step procedures in the overall sample with a few 
exceptions, in terms of both the mean and median squared forecast errors. The two-step pro¬ 
cedures work well in terms of MSFE, but do not seem to work well in the median measure 
since they are frequently beaten by the naive AR(4) forecast. We also find that the MSFE of 
the factor when h = 1 /3 is much worse than other methods, owing to outliers of forecast val¬ 
ues around the subprime mortgage crisis. Tables 2 and 3 show the forecasting performance 
for the first and second subsamples, respectively. In first subsample, the penalized MIDAS 
regression does not necessarily work well; it performs well when h = 0, but worse than the 
factor MIDAS when h = 1/3 and 2/3. However, we also find that the penalized MIDAS 
regression performs well and completely dominates the factor MIDAS and the two-step pro¬ 
cedures in the second subsample in terms of both mean and median measures. Thus, it can 
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be said that the penalized MIDAS regression is more robust than the other methods in terms 
of struetural instability. Furthermore, we find that the MSEs of the two-step proeedure are 
worse than those of the penalized MIDAS regression, overall. Thus, the two-step procedure 
does not provide effective efficiency gains in our situation. A probable reason is that the 
total number of regressors in the second-step OLS regression does not become effectively 
small when we assume a long-length lag structure in the model even if variable “screening” 
is conducted in the first step. This would make the efficiency losses arising from estimating 
many parameters more serious than estimating penalized MIDAS regression directly. Next, 
we turn to the forecast performance when h> The tables show that all the methods have 
similar forecast performances; they perform well when h = \, however, when h > \, they 
are all beaten by the AR(4) forecast. The results are not surprising because Clements and 
Galvao (2008) and Marcellino and Schumacher (2010) also find the same results. Hence, 
our results show that the penalized MIDAS has a good forecast performance in a very short 
horizon, especially in the presence of instability, although it is not necessarily a primary tool 
for a forecast with a relatively long horizon. However, we can conclude that penalized MI¬ 
DAS regression is an effective tool for forecasting with mixed-frequency data because our 
main interest in forecasting with mixed-frequency data is nowcasting where low-frequency 
data are not available. 

4.1.6 Forecast performance in real-time data 

Section 4.1.5 reveals that the penalized regression behaves well in nowcasting with a com¬ 
plete data. However, when we actually conduct real-time forecasting of quarterly GDP with 
monthly data, a complete dataset is not available because of possible publication lags of the 
series. Thus, we must face an incomplete dataset so called “jagged (ragged)-edge” dataset, 
that contains missing values in some latest months. Then we investigate how well the fore¬ 
cast with penalized regression works with the real-time data. It should be mentioned that in 
our experiment, strictly speaking, we consider “pseudo” real-time forecasting; we suppose 
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Table 1: Mean/Median Forecast Errors of the forecasts in complete data [Overall Sample] 



h = 0 

h = 1/3 

h = 2/3 

h = 1 

h = 4/3 

h = 5/3 

h = 2 

MCP 

0.58 

0.50 

0.53 

0.80 

1.17 

1.35 

1.34 

(median) 

(0.80) 

(0.79) 

(0.80) 

(0.57) 

(1.32) 

(1.16) 

(1.18) 

SCAD 

0.59 

0.53 

0.61 

0.79 

1.15 

1.34 

1.34 

(median) 

(0.76) 

(0.86) 

(0.75) 

(0.60) 

(1.28) 

(1.20) 

(1.19) 

Lasso 

0.56 

0.56 

0.60 

0.79 

1.15 

1.33 

1.34 

(median) 

(0.80) 

(0.89) 

(0.70) 

(0.61) 

(1.27) 

(1.17) 

(1.30) 

Factor 

0.83 

2.12 

0.89 

0.75 

1.89 

1.25 

1.25 

(median) 

(1.11) 

(0.89) 

(0.81) 

(1.00) 

(1.12) 

(1.62) 

(1.73) 

post-MCP 

0.79 

0.62 

0.61 

0.82 

1.16 

1.43 

1.50 

(median) 

(1.48) 

(1.19) 

(1.01) 

(0.79) 

(1.09) 

(1.63) 

(2.06) 

post-SCAD 

0.79 

0.60 

0.56 

0.81 

1.21 

1.35 

1.46 

(median) 

(1.20) 

(1.27) 

(0.81) 

(0.88) 

(1.17) 

(1.41) 

(1.59) 

post-Lasso 

0.82 

0.61 

0.93 

0.81 

1.17 

1.36 

1.46 

(median) 

(1.35) 

(1.14) 

(0.92) 

(0.88) 

(1.02) 

(1.48) 

(1.55) 


Note) All values are relative values to AR(4) forecast. Values in parentheses are median forecast errors. 

each monthly data for all evaluation periods have the same jagged (ragged)-edge pattern 
as of the 2016-08 version of the FRED-MD. For example, Real manufacturing and trade 
industry sales (CMRMTSPLx) and the Help-wanted index (HWI) have one and four month 
missing values owing to publication lags in the 2016-08 version, respectively. Then we 
suppose the data for all estimation periods have the same jagged-edge patterns even if our 
dataset contains complete data for those periods. Moreover, we assume no data revisions 
occur in our dataset. 

Tables 4-6 show the relative MSFEs of the penalized regression and the state-space 
ML estimator proposed by Bahbura and Modugno (2014) in the real-time overall sam¬ 
ple (2000Q1-2016Q2), 1st subsample (2000Q1-2007Q4), and 2nd subsample (2008Q1- 
2016Q2), respectively. The tables omit the results for h> \ and concentrate on the nowcast 
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Table 2: Mean/Median Forecast Errors of the forecasts in complete data [1st Subsample] 



h = 0 

h = 1/3 

h = 2/3 

h = 1 

h = 4/3 

h = 5/3 

h = 2 

MCP 

0.76 

0.74 

0.70 

0.93 

1.03 

1.29 

0.76 

(median) 

(0.84) 

(0.94) 

(1.15) 

(0.90) 

(1.37) 

(1.77) 

(0.84) 

SCAD 

0.77 

0.78 

0.71 

0.92 

1.03 

1.27 

0.77 

(median) 

(0.81) 

(0.97) 

(1.07) 

(0.83) 

(1.28) 

(1.53) 

(0.81) 

Lasso 

0.74 

0.77 

0.76 

0.92 

1.03 

1.27 

0.74 

(median) 

(0.59) 

(0.88) 

(1.15) 

(0.84) 

(1.28) 

(1.51) 

(0.59) 

Eactor 

0.86 

0.69 

0.60 

0.86 

1.06 

1.76 

0.86 

(median) 

(0.98) 

(0.86) 

(0.86) 

(1.27) 

(1.46) 

(3.34) 

(0.98) 

post-MCP 

0.95 

1.09 

0.92 

1.09 

1.23 

1.60 

0.95 

(median) 

(1.52) 

(1.31) 

(1.52) 

(1.28) 

(1.51) 

(2.41) 

(1.52) 

post-SCAD 

1.21 

1.00 

0.76 

1.04 

1.07 

1.55 

1.21 

(median) 

(0.75) 

(1.22) 

(1.06) 

(1.59) 

(1.69) 

(1.52) 

(0.75) 

post-Lasso 

1.33 

1.03 

1.51 

1.07 

1.06 

1.60 

1.33 

(median) 

(1.19) 

(1.40) 

(1.54) 

(1.60) 

(1.35) 

(1.52) 

(1.19) 


Note) All values are relative values to AR(4) forecast. Values in parentheses are median forecast errors. 

situation {0 < h < 1) because the real-time forecasting is meaningful only in a very short 
horizon. The state-space ML estimation enables us to handle real-time mixed frequency data 
by embedding missing patterns of data in the model; see Bahbura and Modugno (2014) for 
details. On the other hand, the penalized regression requires an interpolated dataset to obtain 
the forecast values. Thus, we employ an interpolation method based on the EM algorithm 
proposed by Stock and Watson (2002). 

Erom the tables, we first find the effects of the jagged-edge and interpolation on the fore¬ 
cast accuracy of the penalized regression are negligible since they do not essentially affect 
the mean/median squared forecast errors values compared with the results in Tables 1-3. 
Second, we see that the penalized regression performs well in the overall and 2nd subsam¬ 
ple; it beats the state-space ML when = 2/3 and 1 in both the mean/median measures, and 
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Table 3: Mean/Median Forecast Errors of the forecasts in complete data [2nd Subsample] 



h = 0 

h = 1/3 

h = 2/3 

h = 1 

h = 4/3 

h = 5/3 

h = 2 

MCP 

0.49 

0.38 

0.45 

0.73 

1.24 

1.38 

1.37 

(median) 

(0.86) 

(0.79) 

(0.79) 

(0.54) 

(1.32) 

(0.96) 

(1.16) 

SCAD 

0.50 

0.41 

0.57 

0.73 

1.21 

1.37 

1.38 

(median) 

(0.76) 

(0.89) 

(0.66) 

(0.56) 

(1.35) 

(1.20) 

(1.30) 

Lasso 

0.48 

0.47 

0.52 

0.73 

1.21 

1.37 

1.38 

(median) 

(1.09) 

(1.06) 

(0.68) 

(0.59) 

(1.35) 

(1.15) 

(1.30) 

Factor 

0.82 

2.81 

1.03 

0.69 

2.30 

0.99 

1.19 

(median) 

(1.54) 

(1.39) 

(0.94) 

(1.07) 

(1.09) 

(1.53) 

(1.43) 

post-MCP 

0.71 

0.40 

0.46 

0.69 

1.13 

1.35 

1.38 

(median) 

(1.56) 

(1.28) 

(0.84) 

(0.72) 

(0.89) 

(1.56) 

(2.24) 

post-SCAD 

0.59 

0.40 

0.47 

0.70 

1.28 

1.25 

1.36 

(median) 

(1.78) 

(1.48) 

(0.82) 

(0.72) 

(1.14) 

(1.60) 

(1.54) 

post-Lasso 

0.57 

0.41 

0.65 

0.68 

1.23 

1.24 

1.32 

(median) 

(1.65) 

(1.09) 

(0.77) 

(0.72) 

(1.14) 

(1.60) 

(1.46) 


Note) All values are relative values to AR(4) forecast. Values in parentheses are median forecast errors. 

Table 4: Mean/Median Forecast Errors of the forecasts in jagged-edge data [Overall sample] 



h = 0 

h = 1/3 

h = 2/3 

h = 1 

MCP 

0.58 

0.50 

0.54 

0.80 

(median) 

(0.82) 

(0.81) 

(0.88) 

(0.62) 

SCAD 

0.60 

0.53 

0.62 

0.80 

(median) 

(0.79) 

(0.88) 

(0.81) 

(0.67) 

Lasso 

0.57 

0.57 

0.61 

0.80 

(median) 

(0.92) 

(0.93) 

(0.78) 

(0.67) 

State-Space ML 

0.47 

0.49 

0.66 

0.84 

(median) 

(0.70) 

(0.71) 

(1.00) 

(1.02) 


Note) All values are relative values to AR(4) forecast. Values in parentheses are median forecast errors. 
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performs as well as the state-spaee ML when h = 1/3 while it does not relatively work well 
in the 1st subsample as in the complete data case. The state-space ML is expected to have 
higher forecasting performance than the penalized regression because the state-space ML is 
based on a system equation with richer information while the penalized regression relies on 
a single equation. However, this would not be true when a model mis specification is present, 
as Bai et al. (2013) claimed. Then, our results that reveal the penalized regression can be 
compete with the state-space ML in terms of forecasting accuracy imply that the system 
equation contains a certain level of the misspecification. Moreover, it should be mentioned 
that the penalized regression is much simpler and rapid than the state-space ML in obtaining 
the forecast values. Since the dimension of the state-space model can be very large when we 
forecast with mixed frequency (117 dimensional state-space models with 40 latent factors in 
our case), the estimation is much computationally demanding and time consuming (roughly 
eight times longer than the penalized regression). Furthermore, the estimated values can be 
unstable if we consider to apply the state-space ML to a dataset with larger N and/or r. 

Although we do not examine them here, the Ridge regression and the Bayesian VAR 
(BVAR) would be potential alternatives to the state-space ML (e.g., De Mol et al., 2008; 
and Schorfheide and Song, 2015). However, they are also computationally demanding (the 
BVAR requires more than 100,000 parameter estimation in our case) and their theoretical 
properties have not been investigated yet under “ultra”high- dimensionality (i.e. p diverges 
at a sub-exponential rate). 

4.2 Screening Effective Portfolio from a Large Number of Potential Securities 

Recent studies on portfolio selection have focused on the penalized regression because it 
plays a crucial role in constructing a portfolio when there are a large number of potential 
stocks. Brodie et al. (2009) find out the penalized regression is useful in selecting optimal 
portfolio in terms of the out-of-sample performance measured by the Sharpe ratio; Fan et 
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Table 5: Mean/Median Forecast Errors of the forecasts in jagged-edge data [1st Subsample] 



h = 0 

h = 1/3 

h = 2/3 

h = 1 

MCP 

0.76 

0.74 

0.70 

0.93 

(median) 

(0.84) 

(0.94) 

(1.15) 

(0.90) 

SCAD 

0.77 

0.78 

0.71 

0.92 

(median) 

(0.81) 

(0.97) 

(1.07) 

(0.83) 

Lasso 

0.74 

0.77 

0.76 

0.92 

(median) 

(0.59) 

(0.88) 

(1.15) 

(0.84) 

State-Space ML 

0.62 

0.67 

0.71 

0.94 

(median) 

(0.72) 

(0.72) 

(0.65) 

(1.17) 


Note) All values are relative values to AR(4) forecast. Values in parentheses are median forecast errors. 

al. (2012) introduced gross-exposure constraints to admit short sales in the estimation of an 
optimal portfolio; Carrasco and Noumon (2012) focused on estimating a precision matrix of 
returns. They found the penalized regression is quite useful to stabilize the estimation of the 
covariance matrix and provided better finite sample performances than traditional methods. 

To the best of our knowledge, the existing literature concerning applications of the penal¬ 
ized regression to portfolio selection focused on yieldability. However, it seems interesting 
to examine the consistent estimation of weights of the portfolio; that is, screening how fund 
managers construct their portfolio from a large number of securities is valuable. Unlike the 
other high-dimensional estimation methods, such as the factor and the Ridge, the SCAD- 
type penalized regression enables us to screen their portfolio from a large dataset of stock 
prices. In this section, we examine how well the penalized regression usefully works in this 
direction using a large NYSE stock price dataset. 

4.2.1 Construction of Portfolio 

Suppose a fund manager faces p potential stocks, where Xt, is the rate of return of the ith 
(i = 1,2,..., p) stock at time t. Let Xt = [xu, X 2 t ,..., XptY be the p-dimensional rate of the 
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Table 6: Mean/Median Forecast Errors of the forecasts in jagged-edge data [2nd Subsample] 



h = 0 

h = 1/3 

h = 2/3 

h = 1 

MCP 

0.50 

0.39 

0.46 

0.74 

(median) 

(1.06) 

(0.82) 

(0.81) 

(0.55) 

SCAD 

0.51 

0.42 

0.58 

0.74 

(median) 

(0.83) 

(0.90) 

(0.71) 

(0.65) 

Fasso 

0.49 

0.48 

0.53 

0.74 

(median) 

(1.13) 

(1.10) 

(0.69) 

(0.65) 

State-Space MF 

0.41 

0.40 

0.63 

0.79 

(median) 

(0.84) 

(0.76) 

(1.05) 

(1.12) 


Note) All values are relative values to AR(4) forecast. Values in parentheses are median forecast errors. 

return vector at t and ojq be the /)-dimensional weight vector of the portfolio that satisfies 
ll<^ollo = s p), I'ojo = 1 and ||mo||i = where £ [1,'^) and r is a p-dimensional 
vector with all elements being one. That is, the portfolio is constructed by s stocks from p 
potential stocks. We assume the fund manager constructs her portfolio as 

yt = xj o)Q + Ut, t = ( 5 ) 

where Ut is a “miscellaneous” component that includes all assets in the portfolio other than 
stocks, such as T-bills and corporate bonds. Further we assume that x, and m, are indepen¬ 
dent of each other and Ut ~ i.i.d.N{0,crl), where cr^ = ruoA is a 

nonzero i'-dimensional subvector of ojq, Za is T x 5 submatrix of X that corresponds to (Oqa, 
and SNR = V(xJ ojo)IV(u,). Although we might consider the case in which Xt and u, are 
dependent by extending the results of Fan and Fiao (2014), this is beyond the scope of our 
research, and we regard Xt and Ut as independent here. 

The portfolio allows short sales if 4'w > 1 with determining a constraint on the short 
sales as shown in Fan et al. (2012). Fet + 1) /2 and w~ = - 1) /2. Then 

Wq and Wq correspond to the total proportions of long and short sales, respectively, since 
Wq + = ll^^olli and Wq-w^ = 1, and becomes larger as 4'w grows while short 
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sales are not allowed if = 1 (w^ = 0). We assume the fund manager holds equal amounts 
of long and short sales of s/2 and that she employs equal weights among long and short 
sales; that is, we assume mo, = w^/is/l) for i e ojoa+, -WqHs/I) for i e ojqa-, and 0 for 
i e ojqb, where mo, is zth element of ojq, and moA+, ojqa-, and (Oqb are sets of stoeks of long, 
short, and no sales, respeetively. 

4.2.2 Data and Evaluation Strategy 

We retrieve weekly stock price data of the NYSE from Yahoo! Finance. Our dataset contains 
1853 adjusted stock prices {p = 1853) with starting from the 1st week of January in 2009 to 
the 4th week of April in 2016. In this application, we apply the log-difference to the stock 
price data and standardize them so that the data are converted to rates of returns with zero 
means and unit variances. We investigate the cases of 5 = 34 and 40 with a = 14, SNR = 
10, and = 10. Non zero s stocks are drawn randomly from p candidates with equal 
probabilities. Furthermore, we assume the fund manager does not rebalance the portfolio. 
Hence it remain unchanged in all sample period. Brodie et al. (2009) argue a possibility of 
estimating a weight vector for a portfolio in the presence of rebalancing with a penalized 
regression, but we do not consider the case here. 

The purpose of this application is to screen the kinds of stocks in which the fund manager 
invests from a large number of potential stocks using the penalized regression. We examine 
how well the penalized estimator m can distinguish the nonzeros from zero elements of 
mo in finite samples. Then we evaluate the finite sample properties of m to focus on SC-A 
= P(sgn(mA) = sgn(moA)) and SC-B = P(sgn(mB) = sgn(moB)); the SC-A is the success rate 
of detecting non-zero elements of mo with the correct sign and the SC-B is that of detecting 
zero elements. We expect that the SCAD-type penalized regression estimator can have high 
SC-A and SC-B values as T becomes large thanks to the oracle property. The SC-A and SC-B 
are sequentially computed for 172 evaluation periods in this application, where the endpoint 
gradually grows by one while the start point is fixed; the initial evaluation period starts from 
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the 2nd week of January 2009 and ends in 1st week of December 2010 (T = 209). The 2nd 
evaluation period runs from the 2nd week of January 2009 to the 2nd week of December 
2010 {T = 210), and so on. The terminal evaluation period is from the 2nd week of January 
2009 to the 4th week of April 2016(r = 381). 

4.2.3 Empirical Results 

Figures 2-3 and Figures 4-5 show the SC-A and SC-B of the MCP, SCAD, and Lasso for 
172 evaluation periods with 5 = 34 and 40, respectively. To begin with, we consider the 
SC-A. At a glance, both Figures 2 and 3 reveal two characteristics of tu. First, the SC- 
A increases toward 1 as T grows for all penalties. Although the SC-A of 5 = 40 seems 
uniformly lower than that of 5 = 34 for all T, this is due to the fact that more nonzero 
elements requires a greater search cost. Second, the SC-A of the Lasso tends to be higher 
than that of the MCP and SCAD when T is relatively small, while it seems reversed when T 
grows large. This is consistent with the theory because the Lasso tends to have many “false 
positive” estimates. That is, it overestimates the total number of nonzero elements since it 
rarely satisfies the assumptions for model selection consistency, while the MCP and SCAD 
satisfy these assumptions in many cases, as argued in Appendix A.2. Then, the SC-A of the 
Lasso is not expected to be higher than that of the MCP and SCAD when T is large. 

Next, we focus on the SC-5. Figures 4 and 5 show that SC-5 of the MCP and SCAD 
are successfully nearly equal to 1 and dominate that of the Lasso for all T. The results 
are consistent with the theory because the MCP and SCAD have the oracle property, which 
means they can detect true zero parameters more precisely than the Lasso can, except for 
extraordinary cases. 

In summary, our empirical results reveal that the model selection consistency of the 
SCAD-type penalty works well in a large stock price dataset. This implies that the penalized 
regression enables us to effectively detect the behavior of fund managers from large financial 
datasets. 
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Figure 2: SC-A when s = 34 (from T = 209 to T = 381) 
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Figure 3: SC-A when 5 = 40 (from T = 209 to T = 381) 

5 Conclusion 

We have studied macroeeonomie foreeasting and variable selection using a folded-concave 
penalized regression with a very large number of predictors. The contributions include both 
theoretical and empirical results. The first half of the paper developed the theory for a 
folded-concave penalized regression in ultrahigh dimensions when the model exhibits time 
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Figure 4: SC-B when s = 34 (from T = 209 to T = 381) 
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Figure 5: SC-B when ^ = 40 (from T = 209 to T = 381) 

series dependences. Specifically, we have proved the oracle inequality and the oracle prop¬ 
erty under appropriate conditions for macroeconomic time series. The latter half of the 
paper provided two empirical applications that motivated us to use the penalized regression 
for a large macroeconomic dataset. The first was the forecasting of quarterly U.S. real GDP 
with a large amount of monthly macroeconomic data taken from the FRED-MD through the 
MIDAS regression framework; the forecasting model consisted of more than 1000 monthly 
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predictors including lags while the sample size was much smaller than the total number of 
predictors. The forecasting performance of the penalized regression is promising one com¬ 
pared to that of the factor MIDAS proposed by Marcellino and Schumacher (2010) and the 
state-space (nowcasting) model of Bahbura and Modugno (2013). The second application 
screened a portfolio that contained about 40 stocks from more than 1800 stocks using NYSE 
stock price data. The oracle property ensured the variable selection consistency, that is, the 
penalized regression with the SCAD-type penalty could detect the portfolio from the data 
theoretically. In fact, we observed that the variable selection consistency worked properly 
when screening the portfolio. Our theoretical and empirical contributions are expected to 
introduce econometricians to the world of ultrahigh dimensional macroeconomic data. 
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Appendix 


A.l Assumptions for the oracle property 

Assumption 7 There is a sequence A = c>(l) such that UGofirlU < A/2 holds with high 
probability. 

Assumption 8 (s/T)^^^ A d = o(l) and = 0 for a sufficiently large T. 
Assumption 9 For all i, max, E < oo. 

Assumption 10 There exists a constant Ch such that the Hessian submatrix satishes with 
high probability, minysR* v~^HaatV > chMI- 

Assumption 11 c/ < An,in(/oAA) < AmaxGoAA) < 1/c/ for a (small) constant c/ > 0. 
Assumption 12 WHsArhco = max||v||2=i \\Hbatv\\oc = Op(\). 

Assumption 13 for some constant > 0. 

A.2 Model selection inconsistency of Lasso 

As far as forecasting is concerned, Theorem 1 shows that the resulting performance does 
not depend on the choice of penalties. However, if we wish to know what variables should 
be selected, the situation changes. We argue that a key assumption for model selection 
consistency for the -penalty (Lasso) does not hold while a SCAD-type penalty does. 

Zhao and Yu (2006) studied a concept called sign consistency defined by P(sgn(jS) - 
sgn(;So)) 1’ which is stronger than model selection consistency. Under a deterministic 
covariate assumption, they show that the weak irrepresentable condition 

||FfBA7'^fAA7'Sgn(j8oA)lloo < 1 
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is necessary for the sign consistency of Lasso. To establish the model selection consistency 
of Lasso, we usually need a stronger condition 

\\HbatH~j^at\\°o ^ C for some C 6 (0,1), 

which was supposed by Fan and Lv (2011). It seems difficult to prove model selection con¬ 
sistency for the Lasso without this condition; however, the condition may be easily violated. 
Let Xi, i 6 5, be a column vector of Xb- Then, the left-hand side of the bound is 

\\HbatH~^\t\\oo = max||(ZjZAr‘Xjx,||i =: max||7r,||i, 

ieB ieB 

where Tti 6 is regarded as the OLS estimator of regression of an irrelevant variable Xj 
on important variables X^. Due to stationarity, this is Op{q) provided that the regularity 
conditions for an asymptotic theory are satished. Even when q is hnite, it is unrealistic 
for this value to be strictly bounded by one since macroeconomic data have cross-sectional 
dependence in general. When lagged variables are included in X, the condition becomes 
more tight because A and B may share the same variable. Violation of the condition would 
lead to a collapse of economic interpretation of estimated coefficients with the Lasso. 

A.3 Lemmas for Theorems 1 

The following lemmas were given by Loh and Wainwright (2015, Lemma 4(b) and Lemma 
5), and are consequences of Assumption 2. They are used to hll the gap between the ^i-norm 
and SCAD-type penalties. The proofs are omitted. 

Lemma 1 Under Assumotion 2, any vector e R^ satisfies 

mh < wPAmu + (M/2)m\i 

Lemma 2 Under Assumotion 2, for any vector &W such that - Pxifi) > 0 and 

^>l,we have 

^\\PAifi,)\\i - \\PAm\i < m\PA-PmWi - mB -)8oaiii- 
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A.4 Lemmas for Theorem 2 


In Lemma 3 below, let A := {j e ..,p\ : 4^ 0}, a set of indices corresponding to all 

nonzero components of and denote a subvector of yS formed by its restriction to A. 
The other symbols are defined analogously. Let o denote the Hadamard product. The sign 
function sgn(-) is applied coordinate-wise. Define 

= -T-^Xly + 

Gsjifi) = -T-^Xly + T-^XIXA. 

Define the local concavity at e R'' with \\b\\o = r as KA(b) = max\<j<r -p'{(\bj\). 

Lemma 3 Suppose Assumption 2 holds. Then fi is a strict local minimizer ofQjifi) in (2) if 


GArifi) + PAifiA) ° sgnfyS^) = 0, 

(6) 

\\GsTm\oo < p'a(0+). 

(7) 

Amin(^AAr) ^ 

(8) 


Conversely, any local minimizer ofQjifi) must satisfy (6), (7), and (8) with strict inequalities 
replaced by nonstrict ones. 

The proof was given by Lv and Fan (2009, Theorem 1). Consider the case where yS^ g 
No. Under Assumption 8, it holds that sup^ Kxifif) = 0 for sufficiently large T. Thus, 
condition (8) is satisfied as long as AmmCFf^Ar) is bounded away from zero. 

A.5 Proofs of Theorems 1 and 2 

Proof of Theorem 1 Because yS minimizes Qrifi), we have 

(2r)-'||y - xMl + WpAmi < (2r)-i||y - AySoll2 + \\PA(fio)\\i- 
By model (1) and Holder’s inequality, this can be rewritten and bounded as 
(2r)-'||A(y8-y8o)ll^ < T-^u-^X(fi-fio) + \\pA(fio)\U - WPAmU 
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( 9 ) 


In what follows, we have only to work on event Si defined in Assumption 3. On the event, 
we have ||r“‘Z^H||oo < d/2, so that (9) becomes 

{ 2 Tr^\\X(fi-fi^)\\l < 2-^A\\fi-fi^\\i + ||a,08o)IIi " IIPiOS)lli- dO) 

By Lemma 1, the first term in the upper bound of (10) is further bounded by 

< \\Pa 0)\U + \\PA(fio)\\i+(PnW-fio\\i (11) 

where the last inequality follows from the subadditivity implied by the concavity of the 
penalty function. On the other hand, since ||jS - ySgllo ^ ll;S||o + HjSqIIo < m holds on the 
assumed parameter space due to HySgllo = Assumption 4 yields the lower bound of (10); 
that is, we have on £2 defined in Assumption 4 

T-^\\X(fi-fi,)\\l>y\\fi-Ml (12) 

Therefore, combining (10) with (11) and (12) gives 

(r-iu/2)\\fi-fi,\\l < MPA(fio)\U - WPAmU. (13) 

In particular, (13) implies 'i\\px{fio)\\\ - \\PA(fi)\\\ ^ 0, so that we can apply Lemma 2 to the 
right-hand side of (13) to obtain 

{y-pl2W-p,f^ < 3d|^^ -;8 oaIIi - -PmWi- (14) 

Ignoring the last term and the Cauchy-Schwarz inequality lead to 

{y-pl2W-p,\\l < -Ml < 3s^l^AW^-Mi < 2s^'^AW-M^ 

which concludes the error bound in the ^ 2 -norm 
6s^^^A 

W-M<-. -• ( 15 ) 

2y-fi 
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Using (15), we can obtain the error bound in the ^i-norm as well. Since (14) also implies 


that \\Pb -)8obIIi ^ ^Wa -PoaWi^ we have 


W-P,\\i = WA-PoA\\i + WB-P. 


oalli 


< 


Wa-PoaWi < ^s^'^Wa-PoaWi < As^l^\P-P,h < 


24^4 


( 16 ) 


2y-H 

Finally, we derive the prediction error bound from (16). The Mean value theorem, As¬ 
sumption 2, and the triangle inequality give 

p p 

llp.i(jSo)lil - ||p,(j8)||l = (\PAifi^j)\ - \PAifij)\) = Yj (^Oil - \fij\) 

1=1 .7=1 

P 

1=1 

where bj is a point between \/3oj\ and Hence, using (10), we obtain 

llsA^ 




(17) 


2y-p 

Results (15)-(17) hold with probability at least 1 - (7(p“^‘) - 0{cxp{-C2T) by Assumptions 
3 and 4. □ 


Proof of Theorem 2 First, we show results (a) and (b) through the following steps. 

Step 1. We consider Qrifi) in the correctly constrained space {)S G = 0 G R^""'}, 

which is the i'-dimensional subspace {yS^ G R^}. The corresponding objective function is 
given by 

QTifiA,^) = (2r)-i||y - XaPaWI + IIAi08a)IIi- (18) 

We now show the existence of a strict local minimizer ySg^ of 27’(yS^,0) such that ||ySoA “ 
ySoAll = Op{{slTyi'^). To this end, it is sufficient to prove that, for a large constant C > 0, the 
event 

&Q = ( inf + v{slTfl\ 0) > QjifioA^ 0)| (19) 

(^l|V||2=C J 
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occurs with probability tending to one. This implies that, with high probability, there is a 
local minimizer)S oa of QrifiA^ 0) iri the ball Nc = [Pa ^ • Wa ~ PoaW^ ^ 

By the definition of the objective function, we have 

RAv) := QrifioA + v{slTfl\Q) - 

= {sm^'A'^G^AT + {sITVHaatV (20) 

+ ||p,(j8oa + v{slTfl^)h - \\PA(fioA)\U- (21) 


First, we evaluate the two terms in (21). The Mean value theorem gives 

|| a ,( j 8 oa + v(s/Tf^)\U - \\PA(fioA)\\i = Yj P'Mjmoj + Vj(s/Tf^\ - \fioj\) 

<p'M){sITA^\\v\\,, (22) 


where |/3 q^.| lies between |/3oyj and |/?o; + Vj{slTA\ and the last inequality follows from the 
monotonicity of p\{-), min^gA lySg^l > d, and the triangle inequality. Eventually, the last term 
is zero by Assumption 8. Next, we consider (20). Since martingale difference sequences are 
serially uncorrelated. Assumption 9 entails that 


E||GoArll2 = T-^mu^XAXju] = E[m^x,xJm] 




n— 2 


E 

JeA 


( T 


Y 


V t=\ 




jeA t=\ 


0{slT). 


This together with the Markov inequality implies that UGoArlb is Gp((5'/r)^^^). Therefore, 
the Cauchy-Schwarz inequality yields 


{slTfl^\v'^G,AT\ < (^/T)'/ 2 ||v|| 2 l|GoArll 2 = OpAimvh. 


Whereas, by Assumption 10, we get 

{sITyHAATV > {sIT)A^,{Haat)\\v\\1 > {sIT)ch\\v\\1. (23) 
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Because (23) dominates the other terms of Rt(v) when a large value of |iv|i 2 is taken, inf||v|| 2 =c ^r(v) 
tends to positivity as T grows large. Thus, with probability approaching one, (19) holds, and 
WoA-M2<Cis/Ty/\ 

Step 2. To complete the proof of (a) and (b), it remains to show that := (;So24,0) 
is indeed a strict local maximizer of Qrifi) in MJ\ From Lemma 3, it suffices to check 
conditions (6), (7), and (8) with setting fi = jSq, but condition (6) is satisfied by the proof of 
Theorem 1 in Fan and Lv (2011). 

We then check Condition (8). Define No := ^ • W^a ~ )SoaIU < d}, where 

we recall d = minygA LSoj|/2. By Assumption 8, we have J/(5/r)‘^^ ^ oo, so that, for 
sufficiently large e Nc implies e A/q. Thus the condition is eventually satisfied by 
Assumptions 10 and 11 along with the comment after Lemma 3. 

To verify (7), we first see that jX = o(l) by Assumption 8. Thus, Assumptions 

7 and 12 establish 


I|Gb7’(^)IIoo - WHsArifiA ~Poa) + ^oerlioo < WH-BAriPA ~.^oa)IIoo + IIGoBrlloo 
^ WNBAt\\2,oo\\PA ~.^0aII2 + d/l 
= Op{l)C{slTfl^ + T/2 = |op(l) + 1} d/2. 

Since p\{Q+) = d in Assumption 2, condition (7) holds for a sufficiently large T. This 
completes the proof of (a) and (b). 

Finally, we prove (c). Clearly we only need to show the asymptotic normality of Pa- 
Assumption 11 ensures that Iqaa is positive definite, and hence, is well-defined. On 
the event fig in (19), it has been shown that e Nc is a strict local minimizer of Qt^Pa^ 0) 
and dQriPA, ^)ldPA = 0. We thus obtain, for any vector b eW such that ||^»||2 = 1, 


1/2 lT ,- 1/2 






= T''VlZ^G„^r + T'lV p’,0A 


sgn(pA). 


(24) 


Recall that I^]IIGoat = ^Tt and is a martingale difference array. We show the 
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asymptotic normality of this part. It is not hard to say that 

T 

J]varfe) = = 1. 

/=1 

Assumption 13 implies uniform integrability of Hence, by Theorems 24.3 and 24.4 of 
Davidson (1994, Ch. 24), we obtain ^Tt —^d N{0, 1). Because the last term of (24) is 
Op(l) by the argument above, the result follows from the Slutsky lemma and Assumption 
10. □ 


A.6 Lemmas for Proposition 1 

Recall that Cxu = limsupT- < oo. 

Lemma 4 Under Assumption 6, we have for any i and a > 0, 

P{msLx\xtiU,\ > aj < 4rexp{- q;/(2c;c„)} . 

Proof We see that 

P{\x,iUt\ > a) < + P{^Ut\ > . 

We consider the first term. By the construction of xa with suppressing the superscript, we 
have 


/ 

T P 

\ 


f 

T P 

\ 

p 

^ ^ ^ kiZsk 

>a^l^ 

= P 


^ ^ ^ts^^kiZsk 

> {alR,TT.pifl^ 


s=\ k=\ 



\ 

i=l k=\ 

j 


< 2exp j-Qr/(2R(rZp;)} < 2exp{-a/(2c;,„)}, 


where the first inequality holds since (R/rSp,)”^^^ SLi ZLi ^tsCruZsk is a standard normal 
random variable and the last inequality follows from Assumption 6. It is clear that we 
obtain the same result for Ut. Therefore, by the union bound, we have 

p[m&x\xtiUt\ > a^< AT exp {-q'/(2c;cm)} , 
which yields the desired inequality. □ 
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Lemma 5 Let A = Cq log(/)r)(log for any positive constants Cq. Let mq be an arbi¬ 

trary positive constant. Under Assumption 6, we have for any i, 


PIT > A/2 |max|x,,Mf| < mologipT)] < 2p 


-ddiSml) 


Proof Because (xtiU,, Tt) is a martingale difference sequence with respect to Tt = {ut-j, Xt-j+\ 
j = 0,1,...} for each i, Azuma-Hoeffding’s inequality yields 



T 

\ 

/ \ 

(r ^\xju\> A/2 1 max|x,;ar| < crj = P 

V X,iUt 

> TA/2 1 max \xtiUt\ < a 

\ t ) y 

f=l 

* j 


< 2 exp 


{TAI2f 

2Ta^ 


for any a > 0 for each T and p. Plugging A and a = mq \og{pT) into the upper bound, we 
have 


2 exp 


TclQogpTf \ogp 


STmlilogpTf 
giving the result. □ 


= 2 exp 


CpiogP 


= 


-Co/(8mg) 


A.7 Proofs of Propositions 1 and 2 

Proof of Proposition 1 By the union bound and the property of the conditional probability, 
we have for any a > 0, 

p 

P{8\) = P{T-^ max|x>| > d/2) < YjP{T-^\xJu\ > A/2) 

' i=l 

P P 

< ^ P{T~^\xJu\ > A/2 I max \x,iU,\ < a) + ^ P(max \x,iU,\ > a). 

1=1 ' 1=1 ^ 

Let a = mq \og{pT) be the same as in the proof of Lemma 5. From Lemmas 4 and 5, this is 

bounded as 

P{&\) < + ApT Qxp{-mo\og{pT)/{2c^u)] 
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Since mo and Cq are arbitrary, putting mo = Co/4 and Co > 16c_r„ reduces to 
P(&\) < 2p-^ + < 2p-^ + 4ipT)-^ < 6p-\ 

giving the result. □ 


Proof of Proposition 2 We have 


m < < T, 


(25) 


where the last strict inequality holds for large T. Note that each row of W = is viewed 
as an p-dimensional random vector independently sampled from N{0,'Lx)- Let d = rfsupp(a), 
X = Xsupp(rf) and W = Wsupp^). For any supp(</) c {1,... ,p} satisfying (25) and ||rf||o < m, 
we see that 


T-^\\ml/\\d\\l = T-^ 


( d'w^RxWd 

(d^W^Wd\ 

[ d^W^Wd J 

[ y~d ] 


> T~^ 


( h^Rxh \ 

h€EJ' \ h h ) dsR"' 




min 


mm 


d^d 


>c«min T-^\\Wd\\l/\\~d\\l, 
deET' 


(26) 


where the last inequalities hold by Assumption 6. We denote by Wt and c the rth row of W 
and the minimum eigenvalue of the covariance matrix of >i>f, respectively. Inequality (26) 
with the fact that c > leads to 


min T-^\\Xd\\ll\\d\\l<c^CRl9 


Cr min T~\\wMl\\dfj < CsC/?/9 


< P 


/ 


rfeR"" 


< P 


V 

f \ 


/ 


min T-^\\Wd\\il\\d\\i<cl9 


{ 21 ) 


An application of Lemma 9 in Wainwright (2009) gives 


-l|iw:iii 2 /II J ||2 


min T-'\\Wdf2l\\df2<cl9 
\d&R"' 


< 2 exp(-r/2). 


(28) 
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Finally, we extend the result uniformly in terms of the choice of supp(</). We see that 
< p'" < exp(0^r) holds for large T by Stirling’s approximation and (25) with Assump¬ 
tion 5. Therefore, taking the union bound with combining (27) and (28) gives 

pi min T~'^\\Xd\\l/\\d\\l < ccCRm < 2exp(0^r - r/2), 

\rf€RP,||</||o<m / 

which goes to zero since 0^ < 1/2 by Assumption 5. Consequently, if we choose y = cgCrI9 
and C 2 = 1/2 - 0^, we achieve the result. □ 


A.8 Collinearity 


We explore how collinearity between Xr and Xa affects the oracle property obtained by 
Theorem 2. Assumption 12 controls how much collinearity is allowed. Recall that Hrat = 
T~^XgXA for Xa g and Xr £ We are interested in the behavior of 


\\Hrat\\2,oo 


max WHratvWoc 
l|V||2 = l 


= max max 

beB ||V||2=1 


T-^xIXav\, 


where we write XaV = YjaeA ^aXa- This value is expected to become unbounded (and hence 
Assumption 12 is violated) under strong collinearity. 

To obtain understandable results, we make the following simplified assumptions: the 
regressors are deterministic, and for any beB and a e A, T~^xlXa Pba ^ 0- Moreover, 
we assume either of the two conditions: 


1. ms&b^R pfoo > c > 0 for all a £ A, 

2. maxi,gg Pba < for some q > 1. 

Condition 1 describes a highly correlated case. The correlation between Xb and Xa always 
exists even if s increases. On the other hand, condition 2 models weaker correlations than 
condition 1 does. Specifically, most of the correlations become small as q becomes large, 
meaning that the effect of collinearity is limited in this case. In fact, it is not difficult to see 
that \\Hrat\\ 2 ,oo diverges at least as fast as under condition 1 while \\Hrat\\ 2 ,oo is uniformly 
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bounded under eondition 2: First, we suppose condition 1 and let v = (^ ... ,s . We 

then observe that 


max max \T > maxlr ^xIXav\ = max 

beB ||V||2=1 ' ' beB ' ' beB 


By condition 1, the last term is bounded from below by 


,-1/2 


5 ^^^max 
beB 


Y^(Pba + o(l)) 


>Y/\c-o{i)X 


aeA 


aeA 

which goes to infinity as ^ ^ oo. Next, we suppose condition 2. By the Cauchy-Schwarz 
inequality, we observe that 


max max IT ^xZXav\ = max max 

beB ||V||2=1 ' beB ||V||2=1 


'YjVaT ^xlXa 

aeA 


= max max 

beB ||V||2=1 


^Va (Pba + o(l)) 
aeA 


< max 

beB 




+ 0 ( 1 )) 


< c 


z.' 






V aeA / V aeA ) 

The last term converges since q> \ under condition 2. 

The following simulation shows that the strong collinearity (condition 1) affects the or¬ 
acle property. Table 7 shows the relative finite sample success rates detecting non-zero (5" C- 
A) coefficients and zero coefficients (S C-B) that are defined as S C-A = P (sgnifiY = sSJ^O^oa)) 
and SC-B = P (sgnfjS^) = sgnfySga)) respectively, and (average) mean squared error for esti¬ 
mates of non-zero coefficients (MSEfjS^)) under Condition 1 compared to that of Condition 
2 when T = 300,500,1000 and c = 0.5,0.98 with q = 4, p = 1.5 exp(r°-^') and s = 20T^-^. 
Then, the finite sample properties of estimators under Condition 1 are equivalent to those of 
Condition 2 if the values in the Table are 1. We can confirm facts from Table 7 that (/) the 
success rates are relatively low under Condition 1 irrespective of the degree of collinearity 
(c) and (ii) the MS E of the Condition 1 is expected to be much worse than the that of Con¬ 
dition 2 asymptotically especially when the degree of collinearity is high. These facts are 
consistent to the theoretical results because the Condition 1 violates Assumption 12 so that 
the oracle property no longer holds under Condition 1. 
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Table 7: Relative S C-A, S C-B and MS E (cond.l/cond.2) 


c = 0.5 c = 0.98 



SC-A 

SC-B 

MSE 

SC-A 

SC-B 

MSE 

T = 300 

0.89 

0.98 

1.10 

0.96 

1.01 

0.99 

T = 500 

0.88 

0.99 

1.19 

0.96 

1.00 

0.98 

T = 1000 

1.00 

1.00 

1.25 

0.95 

1.00 

3.49 


A.9 Related works 

Wang et al. (2007) investigated the asymptotie properties of the Lasso and modified Lasso 
(Lasso*) for the linear regression with the autoregressive error model. They derived the 
model seleetion eonsisteney, and showed the Lasso* can be the oracle estimator. Nardi 
and Rinaldo (2011) considered the estimation and variable (lag) selection of autoregres¬ 
sive models via the Lasso. They mainly focused on the lag selection of the AR parame¬ 
ters. Lasso-type estimation of VAR models has been studied by several authors, including 
Song and Bickel (2011), Nicolson et al. (2015), Basu and Michailidis (2015), and Kock 
and Callot (2015). Theoretically, the latter two papers have significant contribution to the 
high-dimensional time series literature, hut their settings are different from ours. The results 
obtained here are new and much complement their works. Basu and Michailidis (2015) in¬ 
vestigated estimation of general high-dimensional time dependent models via spectral den¬ 
sities of covariates and errors, and derived the non-asymptotic error bound. Kock and Callot 
(2015) derived the non-asymptotic error bound for a high-dimensional VAR model. 
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